RAG 中私有数据准备入门

RAG 的基础

标准的大语言模型（LLMs）是“冻结”在时间中的，受限于其训练数据的截止时间。它们无法回答关于您公司内部手册或昨天私密视频会议的问题。检索增强生成（RAG）通过从您自己的私有数据中检索相关信息，为 LLM 提供上下文，从而弥合这一差距。

多步骤工作流程

为了使私有数据对 LLM 来说“可读”，我们遵循一个特定的流程：

加载：将各种格式（如 PDF、网页、YouTube 视频）转换为标准文档格式。
分割：将长文档拆分为更小、易于管理的“块”。
嵌入：将文本块转换为数值向量（意义的数学表示）。
存储：将这些向量保存在向量数据库（如 Chroma）中，以实现闪电般的相似性搜索速度。

为什么分块很重要

LLM 有一个“上下文窗口”（一次能处理的最大文本量限制）。如果发送一份 100 页的 PDF，模型将无法处理。我们通过分块确保只有最相关的信息被发送给模型。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why is chunk_overlap considered a critical parameter when splitting documents for RAG?

To reduce the total number of tokens used by the LLM.

To ensure that semantic context (the meaning of a thought) is not cut off at the end of a chunk.

To make the vector database store data faster.

Challenge: Preserving Context

Apply your knowledge to a real-world scenario.

You are loading a YouTube transcript for a technical lecture. You notice that the search results are confusing "Lecture 1" content with "Lecture 2."

Task

Which splitter would be best for keeping context like "Section Headers" intact?

Solution:
MarkdownHeaderTextSplitter or RecursiveCharacterTextSplitter. These allow you to maintain document structure in the metadata, helping the retrieval system distinguish between different chapters or lectures.